Leveraging neighborhood representations of single-cell data to achieve sensitive DE testing with miloDE

Single-cell RNA-sequencing enables testing for differential expression (DE) between conditions at a cell type level. While powerful, one of the limitations of such approaches is that the sensitivity of DE testing is dictated by the sensitivity of clustering, which is often suboptimal. To overcome this, we present miloDE—a cluster-free framework for DE testing (available as an open-source R package). We illustrate the performance of miloDE on both simulated and real data. Using miloDE, we identify a transient hemogenic endothelia-like state in mouse embryos lacking Tal1 and detect distinct programs during macrophage activation in idiopathic pulmonary fibrosis. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-024-03334-3.

1. On page 4, the authors propose a 2nd order k-nearest neighbor graph representation for preserving rare cell states.Furthermore, in Supplementary Figure 2, they demonstrate how this approach achieves higher cell type purity scores across neighborhoods for rare cell states than a traditional k-nearest neighbor graph.How does this edge augmentation approach impact the purity and subsequent detection of DE genes in abundant, yet less transcriptionally distinct cell states?Given that those cellular neighborhoods are now more densely connected to one another, and there is a tradeoff between neighborhood size and homogeneity, it is recommended that the authors show cell type purity and DE detection power across all cell states using 1st and 2nd order graphs.
2. Although the authors propose a novel method for differential expression testing and highlight use cases for a neighborhood-level analysis, they provide no comparisons to existing strategies on real or simulated data.It is recommended that the authors benchmark their method against alternative strategies for DE detection (e.g.single-cell frameworks -Cacoa: Petukov et al BioRxiv 2022 and clustering-based frameworks) by measuring TPR/FPR/AUC on simulated data where the ground truth differential expression is known.This would further illustrate and supplement their claims in paragraph 2 on Page 3 in the Introduction section.Moreover, in 1. Supplementary Figure 2 depicts the average neighborhood size for 1st and second-order kNN graphs.How were the order-k values and neighborhood sizes chosen on the x-axes?How are they comparable across 1st and 2nd order graphs?2. On Page 6-7, the authors perform a series of post hoc filtering criteria to reduce the number of tests for increased statistical power and scalability of analysis.Namely, first cells are subsampled with a neighborhood refinement step to ensure 'complete coverage' of the graph.Then genes with low expression across conditions are removed.Lastly, 'uninteresting' neighborhoods are discarded.Rather than sampling cells according to the graph and then further removing redundant neighborhoods, have the authors considered performing representative downsampling or single-cell sketching prior to graph construction?For example, previous work has focused on identifying subsets of cells that preserve the original distribution of cell states (e.g.Kernel Herding sketching: Baskaran et al ACM BCB 2022) or preserving transcriptional diversity by minimizing the Hausdorff distance between sets (e.g.Hopper: DeMeo et al Bioinformatics 2020, Geometric Sketching: Hie et al Cell Systems 2019).
3. In reference to the Neighborhood Assignment and Neighborhood Refinement subsections in the Methods section, it is recommended that the authors further clarify how cells are assigned to neighborhoods.Are neighborhoods represented by the index cell, such that call ells that are connected by an edge to the index cell in the original graph are considered to be included in that neighborhood?In this way, cells can belong to multiple neighborhoods?What does it mean for a cell to be unassigned to a neighborhood?Is a new kNN graph constructed amongst index cells for DE testing? 4. On page 10 line 8, we also observe 'sign' differences -> 'significant' differences 5.The authors reference on several occasions in the methods a 'p-value' (e.g.page 25 lines 30p-value < 0.01, page 24 line 2 -p-value < 0.05, page 35 line 45 -p-value < 0.05, page 28 line 58 -p-value < 0.1).It should be clarified if this p-value is the neighborhood and gene-wise corrected p-value.How are these significance thresholds chosen?Why are they different?6.All abbreviations should be formally defined (e.g.MNN -mutual nearest neighbors on page 21 line 56, PCA -principal components analysis on page 20 line 57, kNN -k-nearest neighbor graph).

Reviewer 2
Were you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used?Yes: All analysis details are well-described.
Were you able to directly test the methods?Yes.

Comments to author:
The manuscript "Leveraging neighbourhood representations of single-cell data to achieve sensitive DE testing" proposes the method miloDE to address issues in cell-type specific differential expression testing in scRNA-seq data.Rather than relying on only imperfect celltype annotations for DE testing, miloDE constructs a cell graph and performs testing within neighbourhoods.
1.I think the method is potentially quite useful and informative.However, the first few paragraphs in the Results section are difficult to get through.Adjusting these would greatly improve understanding and flow of the manuscript.In most of the cases below, stating the purpose and then the details (instead of the other way around) would likely be a sufficient solution or consider moving some details to the Methods section.
a.The explanation of the method in the first Results paragraph is a bit too superficial given the subsequent paragraphs.Specifically, if DE is tested per neighbourhood, how exactly are the overlapping neighbourhoods being leveraged?The sentences do not connect well to assist understanding.
b.In contrast, the next two paragraphs containing supplemental notes have too much detail and without understanding the method better initially, they are too dense.These seem more appropriate in the Methods section or at least some details could be moved to Methods.For example, the sentence "As k increases, the average neighbourhood size for the 2nd-order kNN graph increases considerably faster than it does for the 1st-order, with a neighbourhood size in the low hundreds being achieved when k is between 20 and 30" -What does this mean to the reader?Is this good for the proposed method?c.Why is it important to restrict the average neighbourhood size to the low hundreds cell target?(This is noted in a paragraph in the Results section on the chimeric mouse embryo dataset.)d.The assignment is noted as being done 3 times.Is this just for the simulation or for any analysis?e.Finally, at the end of these paragraphs, it became clear that the purpose was to show that the neighbourhoods generally have high cell type purity using the 2nd-order KNN approach.Stating this initially would be more helpful.Although, it is not clear why this was done only for rare cell types?f.In the paragraph beginning "Finally, we examined how cell type purity affects the sensitivity of DE detection", there are no sentences explaining how DE is actually performed using miloDE prior to this or in the paragraph.
2. The synthetically perturbed/simulated data finds that even using a purity threshold of .1 leads to sufficient testing power.But as noted this also depends greatly on the effect size.Instead of just saying "high sensitivity", it would be helpful to add the average sensitivity to the text for a given effect size, i.e. average 80% sensitivity with log2FC > 2.
3. Any explanation for why the number of cells appears to have a larger effect than the number of biological replicates regarding DE sensitivity and specificity in Supplemental Note 2? This seems a bit surprising since miloDE creates pseudo bulks for each biological replicate when testing.(It could just be the lower figure quality in the supplement submission, the lines are a bit hard to decipher.)Also, the Wang 2019 reference treats each cell as a 'replicate' which miloDE does not, so it is not totally clear why the replicates number does not have a larger effect here.

Minor:
1.In Methods section 2.2, line 30 should read "no gene selection will [be] performed" 2. Is the "bulk DE" approach mentioned in the caption to Supplementary Figure 6, actually meant to say "pseudo-bulk DE"?
We thank each of the reviewers for their insightful comments on our work.Please find below specific responses and actions taken to address their concerns.
Black Arial corresponds to the original reviewer's comment.

Blue Arial corresponds to our response
Red italics correspond to specific changes made to the manuscript (alternatively, we refer to the section of the manuscript that has been substantially rewritten) Reviewer #1: In the manuscript, "Leveraging neighbourhood representations of single-cell data to achieve sensitive DE testing" by Missarova et al. the authors propose a cluster-free statistical framework for differential expression testing across multi-condition single-cell datasets.
Previous methods for identifying differentially expressed genes have largely relied on cell type comparisons between conditions (e.g.disease, control); however, cell type annotations are challenging to determine, thus this approach may fail to detect differential gene expression dynamics that occur at a finer or more continuous resolution.Moreover, cell types may be differentially abundant across conditions leading to false positives.To overcome these limitations, the authors present an alternative approach that performs differential expression testing by assigning cells to overlapping cellular neighborhoods on a refined '2nd order' k-nearest neighbor graph.They demonstrate how this approach, in combination with differential abundance testing, can be used to find gene expression modules associated with disease progression.Furthermore they highlight important considerations when performing differential expression vs. differential abundance testing.
Overall, this manuscript presents a novel method that is broadly applicable to the single-cell community.However, it currently lacks a rigorous comparison of its performance with respect to alternative strategies.I have a few comments that should be clarified and recommend additional analyses that may improve the interpretability of the approach.However, it currently lacks a rigorous comparison of its performance with respect to alternative strategies.I have a few comments that should be clarified and recommend additional analyses that may improve the interpretability of the approach.
We thank the reviewer for their appreciation of the potential utility of our approach.
Major comments: 1. On page 4, the authors propose a 2nd order k-nearest neighbor graph representation for preserving rare cell states.Furthermore, in Supplementary Figure 2, they demonstrate how this approach achieves higher cell type purity scores across neighborhoods for rare cell states than a traditional k-nearest neighbor graph.How does this edge augmentation approach impact the purity and subsequent detection of DE genes in abundant, yet less transcriptionally distinct cell states?Given that those cellular neighborhoods are now more densely connected to one another, and there is a tradeoff between neighborhood size and homogeneity, it is recommended that the authors show cell type purity and DE detection power across all cell states using 1st and 2nd order graphs.
We apologise for this oversight.The focus, in our original submission, on rare cell types was motivated by the intuition that these cell types will be the ones that are primarily impacted by mixing different cell types.However, we agree that the issue might also propagate more broadly and affect even abundant cell types if they extensively mix with other, transcriptionally similar cell types.Therefore, building upon the reviewer's suggestion, we extended and improved the original analysis in the following ways: a) We extended the simulations to better represent multiple different cell types, ranging from highly rare to abundant.

b)
We added a complementary metric to estimate the overall homogeneity of a given neighbourhood assignment.While the metric we focused on in our original submission -the cell type purity score -provides insight into whether any individual cell type consistently mixes with other cell types for a given neighbourhood assignment, it does not reflect the overall homogeneity across all the neighbourhoods in the assignment.To account for this, we added a second metric -relative cell type enrichment.Specifically, we first identify, for each neighbourhood, its most abundant cell type and then report the percentage of cells in the neighbourhood annotated with this cell type.We reasoned that more 'homogenous assignments' are likely to be represented by neighbourhoods with higher relative cell type enrichment (this can be compared across all neighbourhoods and per individual cell types).
The overall conclusions of the extended analysis are that: a) The distributions of cell type purity and the number of cells per neighbourhood (two factors affecting the ability to detect DE genes) are highly similar for abundant cell types between 1st and 2nd order assignments, resulting in very similar DE detection (new Supp. Fig. 2D, new Supp.Fig. 3).As previously, it differs for rare cell types, in favour of 2nd order assignments. b) The relative cell type enrichment is highly similar between 1st and 2nd order for abundant cell types and is slightly higher for rare cell types in 2nd order assignment (new Supp.Fig. 2C).
Consequently, we conclude that 2nd order assignment does not compromise the homogeneity of neighbourhoods for abundant cell types, and that it increases the power to detect DE genes for rare cell types.Specifically, for the rarest cell type -Primordial germ cells -we achieve a maximum detection of 0.78 in the 2nd-order kNN against a maximum detection of 0.63 in the 1st-order kNN.The results of this analysis are described in the updated 1st Section (Supp.Fig. 2,3).We thank the reviewer for this important suggestion.We have extended the analyses of the performance of miloDE on simulated data (2nd Section) and added a comparison with Cacoa and the pseudo-bulk approach using simulations with the known ground truth.
Since the outputs of Cacoa and miloDE are not immediately comparable (single-cell level against neighbourhood level), we had to process the outputs to achieve comparable entities.
In brief, we performed the comparison on the neighbourhood-and single cell-levels.For the neighbourhood comparison, we used the immediate outputs of miloDE; for Cacoa, for each neighbourhood, we calculated average z-scores across all cells from the neighbourhood.For the single-cell comparison, we used the immediate outputs from Cacoa, and for miloDE we 'decomposed' the outputs from miloDE to return the statistic for each cell.Specifically, we first transformed corrected (across neighbourhood) p-values into z-scores, and then, for each cell, we calculated the average z-score across all neighbourhoods to which it was assigned.The main conclusion is that Cacoa, when compared to miloDE, is more impacted by multiple testing correction as well as being more susceptible to single-cell noise.
Specifically, if a fraction of perturbed cells is less than 5%, adjusting z-scores in Cacoa eliminates any significance (AUC = 0.5) while AUCs in miloDE remain high even for the datasets with the smallest perturbed fraction (0.5%).We also note that when performing a comparison at the single-cell level, even for unadjusted z-scores from Cacoa, AUCs are higher in miloDE (average AUC = 0.93) compared to Cacoa (average AUC = 0.72).We hypothesize that this could be explained by the sensitivity of Cacoa to the inherent noise in single-cell data.
Similarly, the outputs of the pseudo-bulk approach and miloDE are not directly comparable.
To attempt to address this, we estimated the sensitivity of the pseudo-bulk approach as a binary value (significant/not significant), whereas sensitivity for miloDE was estimated as a continuous value between 0 and 1 across the neighbourhoods.Overall, as expected, the sensitivity of the pseudo-bulk approach depends, to a greater extent than miloDE, on the real fraction of DE genes.Specifically, when the fraction of perturbed cells is small (below 5%), even a high effect size can not be detected by the pseudo-bulk approach.
The corresponding analyses are described in the updated Section 2 and Supplementary Figures 5, 6, 7, 8.
Moreover, in reference to the Macrophage Activation During Idiopathic Pulmonary Fibrosis subsection in the Results section, the authors highlight how sub-cell type compositional biases might lead to false positive detection of DE genes requiring a cluster-free approach.It would be useful to compare the identification of DE genes from their cluster-free approach to a cluster-based approach We apologise for the initial lack of sufficient comparative analyses between the pseudo-bulk and miloDE approaches.To address this, we have now extended this comparison in the IPF context, focusing particularly on genes that are identified as significantly and strongly DE at the pseudo-bulk level (FDR < 0.1), but which are not significant in any miloDE neighbourhood (corrected across neighbourhoods p-value < 0.1; see the added paragraph in Section 4 for more detail).
Our analyses suggest that there are two (not mutually exclusive) reasons that could drive such results.On one hand, the likely FPs in the pseudo-bulk analyses that are driven by sporadic high expression in a limited number of cells appear to be, at least partially, not detected in miloDE.On the other hand, as discussed in the manuscript, the sensitivity of miloDE is limited by the number of cells per neighbourhood, and this can be (at least partially) rescued by aggregating more cells in the pseudo-bulk approach.Importantly, although both approaches, naturally, have some limitations we want to emphasise that miloDE enables in-depth post hoc analysis (e.g.gene clustering) that is much more challenging using pseudo-bulk approaches.
Additionally, we examined whether sub-cell type compositional biases might lead to the false positive detection of DE genes in the pseudo-bulk approach (the issue originally addressed by the reviewer).To do so, we identified two sub-cell states based on Spp1-expression, since Spp1+ has been shown to be predictive of the differential abundance (DA).Next, for each gene we estimated logFC between Spp1+ and Spp1-cells while controlling for the disease status.We refer to this metric as 'DA marking potential'.We next focused our analyses on genes that are detected as significant in the pseudo-bulk approach, but have no significantly DE neighbourhoods in miloDE.We do not observe that these genes are positively biased for 'DA marking potential'.Therefore, we suggest that at least in this particular setup (disease/cell type/dataset), DA and sub-cell type differences in composition are not a driving force for discrepancies between pseudo-bulk and miloDE.More broadly, we want to emphasise that it is possible that a gene both marks a DA sub-region(s) and is simultaneously DE between control and case conditions, either in one or several DA sub-regions.Moreover, in a disease setting, it is likely that genes that drive the disease state, will both mark DA sub-regions (e.g.disease-state sub-region and healthy-state sub-region) and DE between disease and healthy cells.
Overall, we thank the reviewer for suggesting this analysis since it made us revise the importance of sub-cell type compositional biases for false positive DE detection.We now updated the corresponding statements in the manuscript.Specifically, the updated text goes as follows: On a molecular level, one axis of macrophage variability in IPF is associated with the expression of the fibrotic marker Spp1 (Morse et al. 2019;Reyfman et al. 2019;Adams et al. 2020).The abundance of an Spp1-high subpopulation has been repeatedly observed in IPF We agree with the reviewer and apologise for the misleading wording here.While we claim that pseudo-bulk DE methods don't provide the resolution to cluster genes and study patterns of co-regulation, it is misleading to say that it is impossible to characterise continuous patterns of DE changes with other approaches, such as differential pseudotime analysis as suggested by the reviewer.To improve the clarity of our manuscript, we have updated the text accordingly (see below).
More generally, we do not consider miloDE, in its current form, to be immediately applicable to detecting continuous differential gene expression changes.Such analysis could be performed post hoc, by for example pairing miloDE outputs with pseudotime trajectory inference and studying the changes in neighbourhoods along the identified trajectories.In general, we suggest that miloDE provides a hypothesis-free approach to learn DE expression vectors in a local, heterogeneity-controlled manner, and it provides an output that can be further interpreted and explored based on the underlying hypothesis for the data in hand.With this, miloDE is not directly comparable with Lamian, and therefore we suggest that the benchmarking these two methods lies outside the scope of our study.However, we agree that differential pseudotime trajectory analysis is a highly relevant approach to studying disease-induced cell state changes and we now have incorporated the reference to Lamian in the discussion section: While miloDE approximates gene regulation as a more continuous function of the manifold, it is important to acknowledge that it leverages the same principle of cell grouping as standard per cell type approaches.Therefore, in cases where a manifold is comprised of fairly distinctive, homogenous cell types, high-resolution clustering coupled with a per bulk DE approach will likely yield very similar results to miloDE.Similarly, in the case of continuous trajectories, a per bulk approach combined with the inclusion of a continuous covariate representing a pseudotime (or, perhaps, a more complex and bespoke approach such as tradeSeq (Van den Berge et al. 2020)) might also yield similar results and resolution to miloDE.Moreover, differential pseudotime analysis such as Lamian and others can provide the insights for how certain lineages are affected, both in abundance and expression, during the disease or perturbation (Hou et al. 2023;Campbell and Yau 2018).While powerful in some cases, the trajectory-based branch of methods assumes that the manifold is continuous.In reality, a single-cell manifold is typically comprised of an intricate and complex combination of discrete and continuous cell types, and, moreover, it is often unclear whether the annotated cell types contain further heterogeneity.Combined, we suggest that miloDE allows the user to overcome laborious and bespoke analysis for each individual cell type using an elegant and straightforward approach.
4. How sensitive is Milo-DE to choice of index cell selection?Are log fold changes and corrected p-values reproducible for cells in similar neighborhoods?
We thank the reviewer for bringing up this important point.Since the neighbourhood assignment is semi-random (i.e. it relies on a random selection of cells that is then further refined to select more densely connected cells as neighbourhood centres), it is important to test for the robustness of the final DE detection with respect to this assignment.We note that we provide such testing at each aspect of the method benchmark.Specifically, a) In the analyses of the importance of the embedding, for each parameter choice, we perform an assignment several times and show that the type of embedding has a more pronounced effect than the neighbourhood assignment.

b)
In the analyses of the importance of the order choice, we performed several assignments for each parameter set, and we showed higher cell type purity (and therefore power to detect DE) for the 2nd order neighborhood approach for rare cell types across all replicates of neighbourhood assignments.c) Analogously, multiple choices of parameters for neighbourhood construction were assessed when using the simulated datasets to compare MiloDE against Cacoa and the pseudo-bulk approaches.While the stochasticity in neighbourhood assignment has a measurable effect in edge cases with a very limited number of ground truth DE neighbourhoods, overall we see robust patterns across the tested conditions.Importantly, we conclude that the neighbourhoods that have a similar fraction of 'perturbed' cells are likely to yield similar DE detection.

Milo-DE depends on a series of preprocessing steps and hyperparameters. Although the
The rationale for the choice of [order-k] stems from the analysis in Supplementary Note 2. In this analysis, we first identify that a few hundred cells is a reasonable target for the neighbourhood size in order to ensure high sensitivity.Accordingly, we select the values for [order-k] such that they return the average neighbourhood size of the selected range of few hundred cells.The feature we are comparing between the 1st-and 2nd-order settings is the number of cells per neighbourhood.We agree with the reviewer that there are several ways to approach representative sampling of cells from the population.As pointed out by the reviewer, the preservation of rare cell states is important, and in the context of miloDE, we particularly want to ensure the homogeneity of neighbourhoods that capture rare cell states.We suggest that the approach we take -sampling an excessive number of index cells to ensure that all cells are assigned followed by the neighbourhood filtering and combined with the implementation of 2nd-order kNN for better recapitulation of the local density achieves this goal.As highlighted by the reviewer, geometric sketching could have yielded a similar outcome.However, since geometric sketching is likely to greatly shift the composition of the cell types/states, with potential to impact the significance of estimates after multiple testing correction, we feel that such a comparison is beyond the scope of the current manuscript.Moreover, and importantly, as shown by our simulations, the approach we take strikes a good balance between computational complexity and sensitivity to detect DE genes.
3. In reference to the Neighborhood Assignment and Neighborhood Refinement subsections in the Methods section, it is recommended that the authors further clarify how cells are assigned to neighborhoods.Are neighborhoods represented by the index cell, such that call ells that are connected by an edge to the index cell in the original graph are considered to be included in that neighborhood?In this way, cells can belong to multiple neighborhoods?
What does it mean for a cell to be unassigned to a neighborhood?Is a new kNN graph constructed amongst index cells for DE testing?
We thank the reviewer for this suggestion.To address this, we have extended the relevant Method section accordingly.Specifically, the graph (both 1st and 2nd) is constructed using all cells, and the index cells are used to select neighbourhoods (one neighbourhood corresponds to one index cell and consists of the index cell and all its neighbours in the final graph).Cells can indeed belong to multiple neighbourhoods and we refer to this property as an overlap between the neighbourhoods.Accordingly, it is also possible that a cell does not belong to any neighbourhood and we refer to these cells as unassigned cells.
4. On page 10 line 8, we also observe 'sign' differences -> 'significant' differences We apologise for the confusion caused by bad wording here.What we intended to state is that neighbourhoods in this region have both positive and negative effect sizes (hence sign difference), and overall DA is strongly correlated with the transcriptional proximity to the blood progenitors.We have now updated the original text to achieve better clarity: In agreement with this, we observe a strong correlation between the distance to blood progenitors and enrichment and DA.Specifically, neighbourhoods that are most proximal to blood progenitors neighbourhoods are significantly enriched for Tal1+ cells, which in turn are 'preceded' by the neighbourhoods that are significantly enriched for Tal1-cells.
5. The authors reference on several occasions in the methods a 'p-value' (e.g.page 25 lines 30 -p-value < 0.01, page 24 line 2 -p-value < 0.05, page 35 line 45 -p-value < 0.05, page 28 line 58 -p-value < 0.1).It should be clarified if this p-value is the neighborhood and gene-wise corrected p-value.How are these significance thresholds chosen?Why are they different?
We thank the reviewer for pointing this out.We now ensure that we use only one, generally accepted, significance threshold for raw p-values (0.05) and we use one significance threshold for corrected, either across neighbourhoods or genes, p-values (0.1).The choice of whether to use corrected p-values (either genes or neighbourhoods) or raw p-values was driven by the specific hypothesis/question that is addressed in each analysis.Please note that in the original IPF analyses, we used a threshold of 0.05 for the corrected p-value.For consistency, we have now adjusted this to 0.1, and to prioritise gene candidates for further investigation, we have increased the cut-off for the fraction of significantly DE neighbourhoods to 0.25.Please see the updated IPF analyses in Section 4.
6.All abbreviations should be formally defined (e.g.MNN -mutual nearest neighbors on page 21 line 56, PCA -principal components analysis on page 20 line 57, kNN -k-nearest neighbor graph).
We now ensure that all abbreviations are defined prior to the usage.Specifically, • Differential expression -DE -is defined in the abstract (page 1) • single-cell RNA sequencing -scRNA-seq -is defined in the abstract (page 1) • Differential abundance -DA -is defined in the Introduction (page 2) • k-Nearest Neighbours -kNN -is defined in the Introduction (page 3) • log Fold Change -logFC -is defined in Section 1 (page 4) • Mutual nearest neighbours -MNN -is defined in Section 1 (page 5) • Primordial Germ cells -PGCs -is defined in Section 1 (page 7) • False Discovery Rate -FDR -is defined in Section 2 (page8) • Idiopathic Pulmonary Fibrosis -IPF -is defined in Section 4 (page 11) • Principal component analysis -PCA -is defined in the Methods (page 21) • Area under the curve -AUC -is defined in Section 2 (page 10) Reviewer #2: The manuscript "Leveraging neighbourhood representations of single-cell data to achieve sensitive DE testing" proposes the method miloDE to address issues in cell-type specific differential expression testing in scRNA-seq data.Rather than relying on only imperfect cell-type annotations for DE testing, miloDE constructs a cell graph and performs testing within neighbourhoods.
1.I think the method is potentially quite useful and informative.However, the first few paragraphs in the Results section are difficult to get through.Adjusting these would greatly improve understanding and flow of the manuscript.In most of the cases below, stating the purpose and then the details (instead of the other way around) would likely be a sufficient solution or consider moving some details to the Methods section.
a.The explanation of the method in the first Results paragraph is a bit too superficial given the subsequent paragraphs.Specifically, if DE is tested per neighbourhood, how exactly are the overlapping neighbourhoods being leveraged?The sentences do not connect well to assist understanding.
b.In contrast, the next two paragraphs containing supplemental notes have too much detail and without understanding the method better initially, they are too dense.These seem more appropriate in the Methods section or at least some details could be moved to Methods.
For example, the sentence "As k increases, the average neighbourhood size for the 2nd-order kNN graph increases considerably faster than it does for the 1st-order, with a neighbourhood size in the low hundreds being achieved when k is between 20 and 30" -What does this mean to the reader?Is this good for the proposed method?
c. Why is it important to restrict the average neighbourhood size to the low hundreds cell target?(This is noted in a paragraph in the Results section on the chimeric mouse embryo dataset.) d.The assignment is noted as being done 3 times.Is this just for the simulation or for any analysis?
e. Finally, at the end of these paragraphs, it became clear that the purpose was to show that the neighbourhoods generally have high cell type purity using the 2nd-order KNN approach.
Stating this initially would be more helpful.Although, it is not clear why this was done only for rare cell types?
f.In the paragraph beginning "Finally, we examined how cell type purity affects the sensitivity of DE detection", there are no sentences explaining how DE is actually performed using miloDE prior to this or in the paragraph.
We thank the reviewer for their constructive and helpful suggestions, all of which have helped to make our manuscript as clear as possible.Specifically: a) We have extended the first paragraph to give a complete overview of the method.
This will allow the reader to understand the main steps of the method and thus facilitate the understanding of the following, more fine-grained paragraphs.We have also added a more in-depth explanation of what neighbourhood assignment means from a biological perspective and how it differs from the metacell representation.

b)
We further moved some of the more methodological aspects considering DE testing itself to the methods section.As for some of the aspects considering neighbourhood assignment, on balance we feel that it is still best to keep them in the first section of the Results.Our rationale for this is that neighbourhood assignment dictates how cells will be grouped, which in turn defines the sensitivity of DE detection.Understanding how one relates to the other (both a theoretical understanding and the supporting analysis) is critical for the proper implementation of the method.By keeping this analysis in the first section of the Results we can clearly emphasise the importance of some parameters.However, to ensure better readability, we now split the discussed aspects into subsections, thus facilitating the flow of the paper.

c)
We agree with the reviewer that we do not initially provide sufficient intuition as to why we strive to restrict neighbourhood sizes until later in the manuscript.Since providing this intuition will facilitate a better understanding of the method and subsequent analysis, we now incorporate it briefly at the beginning of the relevant subsection.Specifically, we add the following statement: By design, miloDE aims to detect DE on a local level, within individual transcriptional states.
Having more cells within the neighbourhoods increases the probability of generating more heterogeneous neighbourhoods, thus hindering the ability to capture distinct transcriptional states.Accordingly, ideal neighbourhood assignments should contain neighbourhoods of the smallest possible size.On the other hand, the power to detect DE using pseudo-bulk approaches such as edgeR is highly dependent on and scales with the number of tested cells (Crowell et al. 2020).

d)
As suggested by the reviewer, we now state upfront our motivation for introducing 2nd-order kNN graphs, with further elaboration of the details: We suggest that the 2 nd -order kNN representation returns more homogenous neighbourhoods than the 1 st -order kNN representation, while controlling for average neighbourhood size.
Also, as suggested, we have extended the order analysis across multiple cell types.Since the Reviewer #1 raised similar concerns, we addressed this in more detail above (first comment from the Reviewer #1, pages 2-3).
e) Regarding comment d., a single run of miloDE generates one assignment.
Accordingly, we perform multiple assignments for some analysis to assess the robustness of the algorithm.Note, that for analysis of real-life data (Tal1-mouse chimeras and macrophages in IPF), the assignment is performed only once.We now more clearly state that the assignments are random and can change between different runs of miloDE: Additionally, a certain degree of stochasticity, owing to the random selection of index cells, invariably impacts into any neighbourhood assignment, potentially propagating into discrepancies in the DE detection.To address this, in the following simulations, for each [order-k] combination (hereafter referred to as an assignment) we repeated the assignment f) We also updated Supplementary Note 2 to accommodate the reviewer's comment b.
and set a specific neighbourhood size target driven by quantitative estimation of the sensitivity for a certain effect size.
2. The synthetically perturbed/simulated data finds that even using a purity threshold of .1 leads to sufficient testing power.But as noted this also depends greatly on the effect size.
Instead of just saying "high sensitivity", it would be helpful to add the average sensitivity to the text for a given effect size, i.e. average 80% sensitivity with log2FC > 2.
We completely agree with the reviewer that quantitative statements are more helpful and more appropriate.We have now updated the manuscript accordingly and provide the quantitative estimate to previously qualitative statements (i.e.high sensitivity).
3. Any explanation for why the number of cells appears to have a larger effect than the number of biological replicates regarding DE sensitivity and specificity in Supplemental Note 2? This seems a bit surprising since miloDE creates pseudo bulks for each biological replicate when testing.(It could just be the lower figure quality in the supplement submission, the lines are a bit hard to decipher.)Also, the Wang 2019 reference treats each cell as a 'replicate' which miloDE does not, so it is not totally clear why the replicates number does not have a larger effect here.
We thank the reviewer for bringing up this interesting point.DE detection is affected by several factors, including the number of cells per replicate (and accordingly total number of cells) as well number of replicates in both conditions and class imbalance.Overall, a very low number of cells per replicate is suboptimal, since the resulting pseudo-bulk counts are less likely to resemble bulk RNA-seq, and thus follow a negative binomial (NB) distribution, which in turn will result in a suboptimal estimation of the parameters of this distribution and DE inference (albeit the degree of uncertainty will depend on the mean and variance of gene expression).On the other hand, a higher number of replicates will enable higher statistical power to detect DE genes.Accordingly, we reason that DE detection should improve with both a higher number of cells per replicate and a higher number of replicates.With this, we want to highlight that in the original version of the manuscript, the x-axis corresponds to the total number of cells across all replicates.Accordingly, while comparing two datasets with different numbers of replicates, we simultaneously vary two parameters: the number of replicates and the average number of cells per replicate (which, by design, will be lower for datasets with a higher number of replicates).Therefore, we extended the original analysis to better compare and assess how DE detection depends on the average number of cells per replicate and the number of replicates (Supplementary Note 2, Fig. 1A, B, second and fourth columns).In this comparison, we see that both the higher average number of cells per replicate and higher number of replicates have a pronounced effect on the sensitivity of DE detection.
We also want to elaborate further on the original question regarding the comparison between the importance of the total number of tested cells and the number of replicates and provide two potential explanations for the observed dominance of the former.DE detection between two conditions relies on the absolute difference between the means of the replicates from the two conditions, as well as the variances across replicates within each condition.A smaller variance across replicates in one or both conditions provides a higher chance of significant DE detection.Since we employ a pseudo-bulk approach, the total variance for one condition is a function of variance across cells in each replicate and variance across different replicates from the condition.
More formally, for a single condition, let's denote -theoretical variance between cells σ 1 2 within each replicate (for simplicity we assume it is constant across replicates) and σ 2 2 theoretical variance across replicates.We also denote as the number of samples and   as the number of cells per sample.Following the law of total variance, the variance of   the overall mean across pseudo-bulks (where, for simplicity, we assume that the pseudo-bulk profile is the average of the cell-level profiles) can be decomposed into two terms: the variance of the expectation of means across replicates and the expectation of variances across replicates.Following the formula for variance of sample means, the former term is equal to .Since variance across replicates is variance across sample σ to different donors) is that , and, therefore, the number of replicates/donors will σ 2 2 >> σ 1 2 indeed be a dominant factor.However, in our original synthetic simulations, we set up replicates with no batch effect which potentially resulted in , in turn making the σ 1 2 >> σ 2 2. Although the authors propose a novel method for differential expression testing and highlight use cases for a neighborhood-level analysis, they provide no comparisons to existing strategies on real or simulated data.It is recommended that the authors benchmark their method against alternative strategies for DE detection (e.g.single-cell frameworks -Cacoa: Petukov et al BioRxiv 2022 and clustering-based frameworks) by measuring TPR/FPR/AUC on simulated data where the ground truth differential expression is known.This would further illustrate and supplement their claims in paragraph 2 on Page 3 in the Introduction section.

2.
On Page 6-7, the authors perform a series of post hoc filtering criteria to reduce the number of tests for increased statistical power and scalability of analysis.Namely, first cells are subsampled with a neighborhood refinement step to ensure 'complete coverage' of the graph.Then genes with low expression across conditions are removed.Lastly, 'uninteresting' neighborhoods are discarded.Rather than sampling cells according to the graph and then further removing redundant neighborhoods, have the authors considered performing representative downsampling or single-cell sketching prior to graph construction?For example, previous work has focused on identifying subsets of cells that preserve the original distribution of cell states (e.g.Kernel Herding sketching: Baskaran et al ACM BCB 2022) or preserving transcriptional diversity by minimizing the Hausdorff distance between sets (e.g.Hopper: DeMeo et al Bioinformatics 2020, Geometric Sketching: Hie et al Cell Systems 2019).
replicate across cells, the latter term can be further derived as .contribution of each term to the final variance depends on the difference between and .The expectation for the real-life data (i.e.where replicates correspond σ there are methods for detecting continuous differential gene expression patterns at the cell-level for multi-sample multi-condition datasets (e.g.Lamian: Hou et.al.Nature Communications 2023).Can Milo-DE be used for detecting continuous differential gene expression changes across case/control samples?This would be an interesting application and comparison.
3. On Page 10 lines 55-59, the authors reference how it is impossible to characterize different continuous patterns of DE changes within cell types.This statement is a bit misleading as